Implimenting multiple regression algorythms to predic house price in California ¶
Libraries and data import ¶
Import libraries ¶
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
Data import and basic explorations ¶
Import feature name
indices = pd.read_csv('cal_housing.domain', sep=".", header=None)
indices
| 0 | 1 | |
|---|---|---|
| 0 | longitude: continuous | NaN |
| 1 | latitude: continuous | NaN |
| 2 | housingMedianAge: continuous | |
| 3 | totalRooms: continuous | |
| 4 | totalBedrooms: continuous | |
| 5 | population: continuous | |
| 6 | households: continuous | |
| 7 | medianIncome: continuous | |
| 8 | medianHouseValue: continuous |
Remove Unnecessary wrods from names
indices[0]=indices[0].str.replace(': continuous','')
indices.drop([1], inplace=True, axis=1)
indices
| 0 | |
|---|---|
| 0 | longitude |
| 1 | latitude |
| 2 | housingMedianAge |
| 3 | totalRooms |
| 4 | totalBedrooms |
| 5 | population |
| 6 | households |
| 7 | medianIncome |
| 8 | medianHouseValue |
making the feature to list, to assign it as feature names into main data
col_list=list(indices[0])
col_list
['longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms', 'population', 'households', 'medianIncome', 'medianHouseValue']
Import main data
data =pd.read_csv('cal_housing.data', sep=",")
data.columns=col_list
data.head()
| longitude | latitude | housingMedianAge | totalRooms | totalBedrooms | population | households | medianIncome | medianHouseValue | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 |
| 1 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 |
| 2 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 |
| 3 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 |
| 4 | -122.25 | 37.85 | 52.0 | 919.0 | 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 |
DATA DESCRIPTIONS:
The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. The columns are as follows, their names are pretty self-explanatory:
-
longitude:
A measure of how far west a house is; a higher value is farther west [°]
-
latitude:
A measure of how far north a house is; a higher value is farther north [°]
-
housingMedianAge:
Median age of a house within a block; a lower number is a newer building [years]
-
totalRooms:
Total number of rooms within a block
-
totalBedrooms:
Total number of bedrooms within a block
-
population:
Total number of people residing within a block
-
Households:
Total number of households, a group of people residing within a home unit, for a block
-
medianIncome:
Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k Dollar]
- medianHouseValue: Median house value for households within a block (measured in US Dollars) [Dollar]
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20639 entries, 0 to 20638 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20639 non-null float64 1 latitude 20639 non-null float64 2 housingMedianAge 20639 non-null float64 3 totalRooms 20639 non-null float64 4 totalBedrooms 20639 non-null float64 5 population 20639 non-null float64 6 households 20639 non-null float64 7 medianIncome 20639 non-null float64 8 medianHouseValue 20639 non-null float64 dtypes: float64(9) memory usage: 1.4 MB
There are total 9 feastures and all are numeric type, in which feature: "medianHouseValue" is out target feature for the project
data.describe()
| longitude | latitude | housingMedianAge | totalRooms | totalBedrooms | population | households | medianIncome | medianHouseValue | |
|---|---|---|---|---|---|---|---|---|---|
| count | 20639.000000 | 20639.000000 | 20639.000000 | 20639.000000 | 20639.000000 | 20639.000000 | 20639.000000 | 20639.000000 | 20639.000000 |
| mean | -119.569576 | 35.631753 | 28.638888 | 2635.848152 | 537.917825 | 1425.530210 | 499.557779 | 3.870455 | 206843.910122 |
| std | 2.003495 | 2.135947 | 12.585568 | 2181.633870 | 421.248495 | 1132.463507 | 382.330173 | 1.899615 | 115385.731702 |
| min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
| 25% | -121.800000 | 33.930000 | 18.000000 | 1448.000000 | 295.500000 | 787.000000 | 280.000000 | 2.563100 | 119600.000000 |
| 50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534700 | 179700.000000 |
| 75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.742850 | 264700.000000 |
| max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
Exploratory Data Analysis ¶
Since we have lower number of features in the data set so we can observe relation between each independent feature and dependent feature
Finding relation or distribution of each features with house price ¶
for i in data.columns:
plt.figure(figsize = (16, 12))
#ax = plt.subplot(1,2,1)
#sns.relplot(x=data["medianHouseValue"], y=data[i], data=data, ax= ax);
ax = plt.subplot(1,2,2)
sns.scatterplot(y=data["medianHouseValue"], x=data[i], data=data, ax= ax, size="medianHouseValue",hue="medianHouseValue");
From the aboce scatter plot, by observing it, we can conclude that,
-
from -123 to -117, in this longitude range, most of house price in between from 100000 to 300000, and from -121 to -119, in this longitude range, california has very less houses in higher price sagement
-
Around 34 and 38, in those latitudes, california has maximum values starting form very low price to very high price, although, numbers of costly houses are less.
-
Most of the houses who's age in between 10 to 40 years, prices lies between 100000 to 300000, while the most old houses have 52 years of age.
-
Almost all the houses (cheap and costly both) contains total rooms within a block are from 0 to 6000.
-
Most of houses contain 0 to 1500 bedrooms in the block, have the prices from very cheap to too costly, while, houses with larger number of bedroom are very less in number and having price in between 120000 to 350000.
-
Almost all houses are present in the populated area range from 3 to 5000 people nd having the minimum to maximum cost.
-
In the case of households, most of the houses (cheap and costly) lies between 1 to 1500 range.
- Median income has a linear relation with house price.
sns.scatterplot(y=data["housingMedianAge"], x=data["totalRooms"], data=data, alpha=0.6)
<AxesSubplot:xlabel='totalRooms', ylabel='housingMedianAge'>
By observing totalroom VS age, we can say, more rooms are available in the newer house (ranging between 1 to 40) and maximum numbers of rooms are present in 1 to 10 years old houses.
data.columns
Index(['longitude', 'latitude', 'housingMedianAge', 'totalRooms',
'totalBedrooms', 'population', 'households', 'medianIncome',
'medianHouseValue'],
dtype='object')
sns.scatterplot(x=data["population"], y=data["totalRooms"], data=data, alpha=0.6)
<AxesSubplot:xlabel='population', ylabel='totalRooms'>
From the above diagram, it is clear that, Population is increasing with increasing totalRooms.
sns.scatterplot(y=data["population"], x=data["medianIncome"], data=data, alpha=0.6)
<AxesSubplot:xlabel='medianIncome', ylabel='population'>
wide range of medianIncome is available where the population is very less.
Feature Engineering ¶
Splititng into Input and Output feature ¶
x=data.drop(["medianHouseValue"], axis=1)
y=data["medianHouseValue"]
display(x)
display("-"*90)
display(y)
| longitude | latitude | housingMedianAge | totalRooms | totalBedrooms | population | households | medianIncome | |
|---|---|---|---|---|---|---|---|---|
| 0 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 |
| 1 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 |
| 2 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 |
| 3 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 |
| 4 | -122.25 | 37.85 | 52.0 | 919.0 | 213.0 | 413.0 | 193.0 | 4.0368 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20634 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 |
| 20635 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 |
| 20636 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 |
| 20637 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 |
| 20638 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 |
20639 rows × 8 columns
'------------------------------------------------------------------------------------------'
0 358500.0
1 352100.0
2 341300.0
3 342200.0
4 269700.0
...
20634 78100.0
20635 77100.0
20636 92300.0
20637 84700.0
20638 89400.0
Name: medianHouseValue, Length: 20639, dtype: float64
Chechikng for null values ¶
data.isnull().sum()
longitude 0 latitude 0 housingMedianAge 0 totalRooms 0 totalBedrooms 0 population 0 households 0 medianIncome 0 medianHouseValue 0 dtype: int64
No null values are present in the dataset.
Checking for outliers ¶
x.plot(kind="box", subplots=True, layout=(3,3), figsize=(15,15));
Cleaning outliers with IQR method
for i in x.columns:
Q3=np.percentile(x[i], 75)
Q1=np.percentile(x[i], 25)
IQR=Q3-Q1
UB=Q3+(1.5*IQR)
LB=Q1-(1.5*IQR)
x[i]=x[i].apply(lambda x:UB if x>UB else x)
x[i]=x[i].apply(lambda y:LB if y<LB else y)
x.plot(kind="box", subplots=True, layout=(3,3), figsize=(15,15));
Checking distribution of output feature ¶
y.plot(kind='kde', figsize=(15,5));
y=np.log(y)
y.plot(kind='kde', figsize=(15,5));
Feature Selection ¶
Checking feature inportance ¶
from sklearn.ensemble import ExtraTreesRegressor
import matplotlib.pyplot as plt
model = ExtraTreesRegressor()
model.fit(x, y)
print(model.feature_importances_)
#plot graph of feature importances for better visualization
print("Plotting graph of feature importances for better visualization")
feat_importances = pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nlargest(x.shape[1]).plot(kind='barh')
plt.show()
print("_"*70)
#Getting importance of all features in descending order as a dataframe
print("Getting importance of all features in descending order as a dataframe")
print("_"*70)
feat_importances.sort_values(ascending=False)
feature_df=feat_importances.to_frame()
DF=feat_importances.reset_index().rename(columns={"index":"Features", 0:"Importance"})
DF=DF.sort_values(by=['Importance'], ascending=False)
display(DF)
[0.15839987 0.15336859 0.05993442 0.02961595 0.03500005 0.04416715 0.03364452 0.48586944] Plotting graph of feature importances for better visualization
______________________________________________________________________ Getting importance of all features in descending order as a dataframe ______________________________________________________________________
| Features | Importance | |
|---|---|---|
| 7 | medianIncome | 0.485869 |
| 0 | longitude | 0.158400 |
| 1 | latitude | 0.153369 |
| 2 | housingMedianAge | 0.059934 |
| 5 | population | 0.044167 |
| 4 | totalBedrooms | 0.035000 |
| 6 | households | 0.033645 |
| 3 | totalRooms | 0.029616 |
From the above diagram, we can see, medianIncome is the most important feature but all others has effects on output features also.
Correlation and heatmap ¶
plt.figure(figsize = (12,12))
sns.heatmap(data.corr(), cmap="YlGnBu", annot=True)
plt.show()
So there is strong correlation between totalRooms and Households, totalBedrooms and housHolds, totalBedroom and population, housHolds and population.
drop_list=["totalRooms", "totalRooms"]
x.drop(drop_list, axis=1, inplace=True)
x.head()
| longitude | latitude | housingMedianAge | totalBedrooms | population | households | medianIncome | |
|---|---|---|---|---|---|---|---|
| 0 | -122.22 | 37.86 | 21.0 | 1106.0 | 2401.0 | 1092.5 | 8.012475 |
| 1 | -122.24 | 37.85 | 52.0 | 190.0 | 496.0 | 177.0 | 7.257400 |
| 2 | -122.25 | 37.85 | 52.0 | 235.0 | 558.0 | 219.0 | 5.643100 |
| 3 | -122.25 | 37.85 | 52.0 | 280.0 | 565.0 | 259.0 | 3.846200 |
| 4 | -122.25 | 37.85 | 52.0 | 213.0 | 413.0 | 193.0 | 4.036800 |
Train test split ¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(x, y, train_size=0.70)
print("Shape of X train:",X_train.shape)
print("Shape of Y train:",y_train.shape)
print("Shape of X test:",X_test.shape)
print("Shape of Y test:",y_test.shape)
Shape of X train: (14447, 7) Shape of Y train: (14447,) Shape of X test: (6192, 7) Shape of Y test: (6192,)
Data scaling ¶
from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
scale.fit(X_train)
x_train_scaled=scale.transform(X_train)
x_test_scaled=scale.transform(X_test)
x_train=pd.DataFrame(data=x_train_scaled, columns=x.columns)
x_test=pd.DataFrame(data=x_test_scaled, columns=x.columns)
Model Training and Evaluation of models ¶
def model_evaluate(y_test,y_pred):
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('MAE is {}'.format(round(mae, 3)))
print('MSE is {}'.format(round(mse, 3)))
print('R2 score is {}'.format(round(r2, 3)))
Linear Regression ¶
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train, y_train)
y_pred_lr=lr.predict(x_test)
print("Scores of Linear Regression:")
model_evaluate(y_test,y_pred_lr)
print("Train Accuracy:",lr.score(x_train, y_train))
print("Test Accuracy:",lr.score(x_test, y_test))
Scores of Linear Regression: MAE is 0.245 MSE is 0.105 R2 score is 0.679 Train Accuracy: 0.6774389914905221 Test Accuracy: 0.6788214329625517
pickle.dump(lr, open('linear_regressor.pkl', 'wb'))
sns.regplot(x=y_test,y=y_pred_lr)
plt.show()
Decision Tree Regressor ¶
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
parameter={"max_depth" : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"criterion":["squared_error", "friedman_mse", "absolute_error"],
"splitter":["best", "random"]}
kf=KFold(n_splits=10)
grid_sv = GridSearchCV(dt, cv=kf, param_grid=parameter, scoring='neg_mean_absolute_error')
grid_sv.fit(x_train, y_train)
param_dict_dt=dict(grid_sv.best_params_)
DT=DecisionTreeRegressor(criterion=param_dict_dt["criterion"],
max_depth=param_dict_dt["max_depth"],
splitter= param_dict_dt["splitter"])
DT.fit(x_train, y_train)
y_pred_DT=DT.predict(x_test)
print("Scores of DT Regression:")
model_evaluate(y_test,y_pred_DT)
Scores of DT Regression: MAE is 0.2 MSE is 0.084 R2 score is 0.742
pickle.dump(DT, open('decision_tree_regressor.pkl', 'wb'))
sns.regplot(x=y_test,y=y_pred_DT)
plt.show()
print("Train Accuracy:",DT.score(x_train, y_train))
print("Test Accuracy:",DT.score(x_test, y_test))
Train Accuracy: 0.8659150698903723 Test Accuracy: 0.7421219449398168
Random Forest Regressor ¶
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
parameter={"max_depth" : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
"criterion":["squared_error", "absolute_error", "poisson"],
"max_features":["auto", "sqrt", "log2"],
"n_jobs":[10]}
kf=KFold(n_splits=10)
grid_rf = GridSearchCV(rf, cv=kf, param_grid=parameter, scoring='neg_mean_absolute_error')
grid_rf.fit(x_train, y_train)
param_dict_rf=dict(grid_rf.best_params_)
RF=RandomForestRegressor(criterion=param_dict_rf["criterion"],
max_depth=param_dict_rf["max_depth"],
max_features= param_dict_rf["max_features"])
RF.fit(x_train, y_train)
y_pred_RF=RF.predict(x_test)
print("Scores of RF Regression:")
model_evaluate(y_test,y_pred_RF)
Scores of RF Regression: MAE is 0.173 MSE is 0.061 R2 score is 0.814
pickle.dump(RF, open('random_forest_regressor.pkl', 'wb'))
sns.regplot(x=y_test,y=y_pred_RF)
plt.show()
print("Train Accuracy:",RF.score(x_train, y_train))
print("Test Accuracy:",RF.score(x_test, y_test))
Train Accuracy: 0.8941539893312356 Test Accuracy: 0.8139671771977441
KNN Regressor ¶
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
parameter={"n_neighbors" : [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
"weights":['uniform', 'distance'],
"algorithm":['auto', 'ball_tree', 'kd_tree', 'brute'],
"n_jobs":[10]}
kf=KFold(n_splits=10)
grid_knn = GridSearchCV(knn, cv=kf, param_grid=parameter, scoring='neg_mean_absolute_error')
grid_knn.fit(x_train, y_train)
param_dict_knn=dict(grid_knn.best_params_)
KNN=KNeighborsRegressor(n_neighbors=param_dict_knn["n_neighbors"],
weights=param_dict_knn["weights"],
algorithm= param_dict_knn["algorithm"])
KNN.fit(x_train, y_train)
y_pred_KNN=KNN.predict(x_test)
print("Scores of KNN Regression:")
model_evaluate(y_test,y_pred_KNN)
Scores of KNN Regression: MAE is 0.214 MSE is 0.084 R2 score is 0.743
pickle.dump(KNN, open('knn_regressor.pkl', 'wb'))
sns.regplot(x=y_test,y=y_pred_KNN)
plt.show()
print("Train Accuracy:",KNN.score(x_train, y_train))
print("Test Accuracy:",KNN.score(x_test, y_test))
Train Accuracy: 0.9999999999999554 Test Accuracy: 0.74263420062882
Gradient Boost Regressor ¶
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor()
parameter={"loss" : ['squared_error', 'absolute_error'],
"criterion":['squared_error', 'mse', 'mae'],
"max_depth":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"max_features":['auto', 'sqrt', 'log2']}
kf=KFold(n_splits=10)
grid_gb = GridSearchCV(gb, cv=kf, param_grid=parameter, scoring='neg_mean_absolute_error')
grid_gb.fit(x_train, y_train)
param_dict_gb=dict(grid_gb.best_params_)
GB=GradientBoostingRegressor(loss=param_dict_gb["loss"],
criterion=param_dict_gb["criterion"],
max_depth= param_dict_gb["max_depth"],
max_features=param_dict_gb["max_features"])
GB.fit(x_train, y_train)
y_pred_GB=GB.predict(x_test)
print("Scores of GB Regression:")
model_evaluate(y_test,y_pred_GB)
Scores of GB Regression: MAE is 0.156 MSE is 0.051 R2 score is 0.842
pickle.dump(GB, open('gradient_boost_regressor.pkl', 'wb'))
sns.regplot(x=y_test,y=y_pred_GB)
plt.show()
print("Train Accuracy:",GB.score(x_train, y_train))
print("Test Accuracy:",GB.score(x_test, y_test))
Train Accuracy: 0.9841539063509372 Test Accuracy: 0.8419704386777042
Creating dataframe with MAE, MSE and R2-Score of all trained models to better visualization ¶
regressors = {
'Linear Regression' : lr,
'Decision Tree' : DT,
'Random Forest' : RF,
'K-nearest Neighbors' : KNN,
'Gradient Boost' : GB
}
results=pd.DataFrame(columns=['Train Score', 'Test Score', 'MAE', 'MSE', 'R2-score'])
for method,func in regressors.items():
#model = func.fit(x_train,y_train)
# model=====function
pred = func.predict(x_test)
results.loc[method]= [
func.score(x_train, y_train),
func.score(x_test, y_test),
np.round(mean_absolute_error(y_test,pred),3),
np.round(mean_squared_error(y_test,pred),3),
np.round(r2_score(y_test,pred),3)
]
final_result=results.sort_values('R2-score',ascending=False).style.background_gradient(cmap='Greens',subset=['R2-score'])
display(final_result)
| Train Score | Test Score | MAE | MSE | R2-score | |
|---|---|---|---|---|---|
| Gradient Boost | 0.984154 | 0.841970 | 0.156000 | 0.051000 | 0.842000 |
| Random Forest | 0.894154 | 0.813967 | 0.173000 | 0.061000 | 0.814000 |
| K-nearest Neighbors | 1.000000 | 0.742634 | 0.214000 | 0.084000 | 0.743000 |
| Decision Tree | 0.865915 | 0.742122 | 0.200000 | 0.084000 | 0.742000 |
| Linear Regression | 0.677439 | 0.678821 | 0.245000 | 0.105000 | 0.679000 |